Daniel Martin, Kameron Standhaft, Esra Ari Acar
Latent Class Analysis (LCA) is a probabilistic method of unsupervised clustering that can be utilized when it is believed that there may be unobserved subgroups (classes) among the individuals within a population. (Nylund-Gibson and Choi 2018)
The central assumption of LCA model is the presence of latent classes within a population. (Weller, Bowen, and Faubert 2020)
LCA is a type of finite mixture model (FMM) which is a statistical approach used in unsupervised learning.(Grimm, Houpt, and Rodgers 2021)
LCA is a useful model for identifying subgroups within a population based on patterns in categorical variables.
Advantages:
Identifying hidden subgroups
Handling categorical data
Providing probability estimates
Handling missing data
Model selection
Drawbacks:
LCA is computationally demanding, limiting the number of variables in the analysis
It may be challenging to determine appropriate number of latent classes.
LCA has been used in various fields, such as psychology, sociology, public health, business & marketing research.
The Application of Latent Class Analysis for Investigating Population Child Mental Health (Petersen, Qualter, and Humphrey 2019)
Latent Class Analysis for Marketing Scale Development (Bassi 2011)
The standard equation (Naldi and Cazzaniga 2020) for LCA model is:
\[ p(x_i)=\sum_{k = 1}^{K}{p_k}\prod_{n = 1}^{N}{p_n(x_{in}|k)},\]
\(p(x_i)\): the probability of observing a particular combination of responses in a group of \(N\) variables
\(p_k\): the probability of membership in LC \(k\)
\(p_n(x_{in}|k)\): the probability of response to variable \(n\), conditional on membership in LC \(k\)
Two parameters of an LCA model:
Inclusion probabilities (\(p_k\))
Conditional probabilities (\(p_n(x_{in}|k)\))
The general steps to use latent class analysis are as follows:
Identify the research question and define the variables
Determine the number of latent classes for an ideal model
Estimate the model: Estimate the model parameters using maximum likelihood estimation or Bayesian estimation
Evaluate the model: Goodness-of-fit statistics (the likelihood ratio test, Akaike Information Criterion \((AIC)\), Bayesian Information Criterion \((BIC)\), entropy and chi-square Goodness-of-Fit \((\chi^2)\))
Conduct sensitivity analysis: Finally, conduct sensitivity analyses (e.g. correlation test) to test the robustness of the results and evaluate the stability of the estimated probabilities across different subgroups or samples.
To assess the effectiveness of Latent Class Analysis, we used the ‘Zoo Animals’ dataset from Kaggle. This dataset has 101 entries and is an ideal choice because of its majority of binary variables.
Dropped
There were 2 variables (Catsize and Domestic) that we dropped because we thought they were subjective. And we dropped the ‘Animal’ variable because it was an identifier.
Changed
There was one variable in our dataset that was discrete, so we changed it to a binary variable.
To perform our analysis, we are using these 14 variables.
We are not including the ‘type’ variable in our analysis. Instead we are going to use its different categories (mammal, bird, reptile, fish, insects, amphibian, invertebrate) as comparisons to the classes that we get from our Latent Class Analysis.
In order to determine the optimal number of classes that would best model our dataset, we conducted a series of analyses beginning with 2 classes and concluding at 7 classes, which was the number of different types of animals in the dataset. The code presented below was utilized to fit the model for each of the different number of classes.
The poLCA function has the following options: -nclass -graphs -na.rm -nrep -maxiter -verbose
This function is estimating “Latent Class Prevalence”, which is the probability that each dataset entry belongs within one of the model-generated classes (Law and Harrington 2016)
Entropy is a measure of the concentration in a probability function(R Core Team 2021)
We use the poLCA.entropy() function to compare our 5-Class and 6-Class models.
#Reordering the Model output graph to display highest->lowest proportion for easier comparison and labeling
probs.start.new <- poLCA.reorder(lca_fit5$probs.start,order(lca_fit5$P,decreasing=TRUE))
lca_fit5 <- poLCA(lca_bind, data = new_zoo_int_1,nclass=5,
graphs = FALSE,na.rm = TRUE,
verbose = FALSE, nrep=100, maxiter=100,
probs.start=probs.start.new)
orig_classes <- data.frame(
Class = c('Class 1', 'Class 2', 'Class 3', 'Class 4', 'Class 5'),
Percentage = c('35.46%', '20.78%','17.82%','13.09%','12.85%'))Class 1 which, as the largest class, is estimated to contain ~35.46% of the population predicts its members to have ‘hair’, ‘toothed’, ‘backbone’, ‘breathes’, ‘tail’, ‘has_legs’. We filtered the cleaned new_zoo dataset for these attributes, and noted that all members were classified as type ‘mammal’ in the original dataset.
Class 2 which is estimated to contain ~20.78% of the population predicts its members to have ‘feathers’, ‘eggs’, ‘backbone’, ‘breathes’, ‘tail’, ‘has_legs’. We filtered the cleaned new_zoo dataset for these attributes, and noted that all members were classified as type ‘bird’ in the original dataset.
Class 3 is estimated to contain ~17.82% of the population. It should be noted that it has two attributes (‘breathes’ and ‘predator’) which were both over 50% shown on the graph, but as we excluded our initial filtering assessment to ~80% and above, these are not selected in the filter.
These class members are predicted to have ‘eggs’ and ‘has_legs’. While this limited selection duplicates a few of the previous members that were selected for other classes, this class also contains the invertebrates and insects, which were not present in the other classes. Based solely upon outside comparison to the type attribute in the original dataset, this class appears to be the least homogeneous.
Class 4 which is estimated to contain ~13.09% of the population is less homogeneous than Classes 1 & 2, but not as heterogeneous as Class 3. The original classification schema of “type” had 7 unique values, which implies that with a 5-class model, overlap is inevitable.
This is only visible due to the fact that we have the original data for “type” for illustration purposes, which is unlikely to be the case in research data.This class’ members are predicted to have ‘aquatic’, ‘predator’, ‘toothed’, ‘backbone’, ‘breathes’, ‘tail’. We filtered the cleaned new_zoo dataset for these attributes. The resulting filter only yielded 5 members, which are mostly aquatic mammals and a single amphibian.
Class 5 which is the smallest class, estimated to contain ~12.85% of the population predicts its members to have ‘eggs’, ‘aquatic’, ‘toothed’, ‘backbone’, ‘fins’, ‘tail’. We filtered the cleaned new_zoo dataset for these attributes, and noted that all members were classified as type ‘fish’ in the original dataset.
# Here we are converting the new_zoo_int_2 into a factor so it can be used with MCA() function
new_zoo_int_1_factor <- as.data.frame(lapply(new_zoo_int_1, factor))
# Saving the result of MCA (multiple Correspondence Analysis) function applied to new_zoo_int_2_factor -> mca_outcome
# Made invisible so the output does not print instead of image with label
mca_outcome <- invisible(MCA(new_zoo_int_1_factor, graph = FALSE))
# Setting the # of dimension in the biplot to 2 (just x and y axis)
n_dimensions <- 2
# Creating -> MCA biplot from mca_outcome
mca_biplot <- fviz_mca_biplot(
mca_outcome,
repel = TRUE,
axes = c(1, 2), # we are only displaying axes 1(Dim1)& 2(Dim2)
title = paste("MCA Biplot (Dimensions:", n_dimensions, ")"))
# Extract the class membership probabilities and convert to dataframe
class_probability <- lca_fit5$posterior
class_probability_df <- as.data.frame(class_probability)
# Build a matrix from the dataframe and assign each observation -> highest probability LCA class
class_probability_matrix <- as.matrix(class_probability_df)
lca_result_classes <- max.col(class_probability_matrix, ties.method = "random")
# Turn lca_result_classes -> factor for MCA input
lca_result_class_factor <- factor(lca_result_classes, levels = 1:5, labels = c("Class 1", "Class 2", "Class 3", "Class 4", "Class 5"))
# Find the coordinate points generated by the Multiple Correspondence Analysis (MCA) function
coord_pts <- as.data.frame(mca_outcome$ind$coord)
# Adding the LCA class labels -> dataframe with the coordinates
coord_pts$lca_result_class_factor <- lca_result_class_factor
# Plot the LCA classes as ellipses over the MCA plot -> LCA_MCA_biplot
LCA_MCA_biplot <-
ggplot(coord_pts, aes(x = `Dim 1`, y = `Dim 2`, color = lca_result_class_factor)) +
geom_point(size = 3, alpha = 0.8) +
stat_ellipse(aes(x = `Dim 1`, y = `Dim 2`, color = lca_result_class_factor), type = "norm", level = 0.95) +
geom_text(aes(label = rownames(coord_pts)), nudge_x = 0.02, nudge_y = 0.02, size = 3, check_overlap = TRUE, color = "black") +
labs(
title = "MCA Biplot with LCA Classes",
x = "Dimension 1 (31.2% Variance Explained by this Dimension)",
y = "Dimension 2 (25.3% Variance Explained by this Dimension)",
color = "LCA Class")
LCA_MCA_biplotAs seen in the correlation heatmap figure below, ‘milk’ is highly positively correlated with both ‘hair’ and negatively correlated with ‘egg’. We decided to remove ‘milk’ as an attribute, and re-consider our LCA model to see if it improves without the variables exhibiting this high degree of multicollinearity included.
#Creating a dataset without 'milk' - 13-variables
new_zoo_int_2 <- new_zoo_int_1 %>% mutate(milk=NULL)
#Binding 13-variables into columns for modified LCA model without 'milk'
lca_bind_corr <- cbind(hair, feathers, eggs, airborne, aquatic, predator,
toothed, backbone, breathes,venomous, fins, tail, has_legs) ~ 1
lca_fit5_corr <- poLCA(lca_bind_corr, data = new_zoo_int_2,
nclass = 5, graphs = FALSE, na.rm = TRUE,
verbose = FALSE, nrep=100, maxiter=100)
lca_fit5_corr.ent <- poLCA.entropy(lca_fit5_corr)
#Creating data frame of AIC, BIC, GoF, and Entropy of both original and modified 5-Class Models
class5_compare <- data.frame(Model = c('Original 5-Class Model', 'Adjusted 5-Class Model'),
AIC = c(lca_fit5$aic, lca_fit5_corr$aic), BIC = c(lca_fit5$bic, lca_fit5_corr$bic),
GoF = c(lca_fit5$Chisq, lca_fit5_corr$Chisq), entropy = c(lca_fit5.ent, lca_fit5_corr.ent))Given improvement in all fit statistics except entropy for the 5-Class model with the high correlation variable ‘milk’ removed, we have decided to remove ‘milk’ for our final model and are left with 13-attributes for optimal classification by Latent Class Analysis.
Class 3 & Class 4 show the greatest adjustments to the population proportions as compared to the 5-class model with ‘milk’ still considered. Class 3 sees about a 1% drop in membership, and Class 4 sees ~1.4% increase. Because these two classes were our least homogeneous, it makes sense that they might see the greatest changes in proportion.
Benefits of LCA
Challenges to Consider